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Extended Abstract. The variance reduction established by importance sampling strongly de- 
pends on the choice of the importance samphng distribution. A good choice is often hard 
to achieve especially for high-dimensional integration problems. Nonparametric estimation 
of the optimal importance sampling distribution (known as nonparametric importance sam- 
pling) is a reasonable alternative to parametric approaches. In this article nonparametric 
variants of both the self-normalized and the unnormalized importance sampling estimator are 
proposed and investigated. A common critique on nonparametric importance sampling is the 
increased computational burden compared to parametric methods. We solve this problem to 
a large degree by utilizing the linear blend frequency polygon estimator instead of a kernel 
estimator. Mean square error convergence properties are investigated leading to recommen- 
dations for the efficient application of nonparametric importance sampling. Particularly, we 
show that nonparametric importance sampling asymptotically attains optimal importance 
sampling variance. The efficiency of nonparametric importance sampling algorithms heavily 
relies on the computational efficiency of the employed nonparametric estimator. The linear 
blend frequency polygon outperforms kernel estimators in terms of certain criteria such as 
efficient sampling and evaluation. Furthermore, it is compatible with the inversion method 
for sample generation. This allows to combine our algorithms with other variance reduction 
techniques such as stratified samphng. Empirical evidence for the usefulness of the suggested 
algorithms is obtained by means of three benchmark integration problems. As an application 
we estimate the distribution of the queue length of a spam filter queueing system based on 
real data. 
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1. INTRODUCTION 



Importance Sampling (IS) is a general sampling technique for approximating expectations 

EpM ^I<p^ (^(x)p(x)dx 



of some function : M*^ ^ M with respect to a probability density function p on M'^. It 
is often applied if direct sampling from distribution p is computationally too demanding 
or intractable. But IS is not hmited to this purpose. Unless (p is constant, IS can often 
yield massive reduction of the estimators variance if applied carefully. Formally importance 
samphng is a change of measure. The expectation Ep[ip] is rewritten as 



where q is the probability density function of an importance samphng distribution (also 
known as proposal) and w{x) — p(x)/g(x) the Radon-Nikodym derivative. The proposal 
needs to be chosen so that its support includes the support of \(j)\p or p, which imposes a 
first constraint on q. Using importance samphng the integral can be estimated by 



where {x^, . . . ,x^} are drawn from proposal q. 

In Bayesian inference, it is often the case that either p or the proposal q (or both) are only 
known up to some constant. In this case an alternative is the self-normalized importance 
samphng (SIS) estimator given by 



to the expectation I^p if it is finite. However, this result is neither of help for assessing 
the precision of the estimators for a finite set of samples nor for the rate of convergence. In 
order to construct error bounds it is desirable to have a central limit theorem (CLT) at hand. 
Under the assumptions that and Yaiq[(pw] are finite a central limit theorem guaranties 
^/N {P^ - 1^) =^ A/'(0, cTj^g) where aj^ = 'E>q[pw - I^Y (Rubinstein 1981). The proposal which 
minimizes the variance cr^g is given by 
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k(x)b(x) 
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/|(^(x)|p(x)dx' 
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q^^ is called the optimal proposal. A CLT for the self-normalised IS estimator /^^^ can be 
estabhshed y/N{I^^^ — I^) =^ A/'(0, cr|jg) with limiting variance cr^jg = Eq[{(f — Iip)w]'^ under 
the additional assumption that Varq[w] < oo (Geweke 1989). Variance (jgjg is minimized by 
the proposal 

^SIS/^N _ |9?(x) -/^|p(x) 

^ ^ /|(^(x)-/,b(x)rfx' ^ > 

provided that the median of with respect to p exists. The optimal proposals Q and ^ are 

merely of conceptual help as the computation of their denominators is typically at least as 

difficult as the original integration problem. Hence, the objective is to find an easy-to-sample 

density that approximates the optimal proposals. Traditionally, a proposal is chosen from 

some parametric family of densities {q^,e,0 E 6} that satisfy the assumptions of the central 

limit theorems or some related conditions. Typically, it is demanded that the support of 

q^^o includes the support of \ip\p or \ip — respectively, and that the tails of q do not 

decay faster than those of \(-p\p. Many different density classes have been investigated in the 

literature including multivariate Student t, mixture, and exponential family distributions (see 

for instance Geweke 1989; Stadler and Roy 1993; Oh and Berger 1993). The parametrized 

choice of the proposal can be adaptively revised during the importance sampling which is 

known as adaptive IS (Oh and Berger 1992; Kollman et al. 1999). Often expectation I^p 

needs to be computed for many different functions if leading to different optimal proposals. 

As a consequence, it is necessary to investigate the structure of any new ip in order to find 

a suitable parametric family. 

A reasonable alternative that does not rely on prior investigation of the structure of 
the integrand is nonparametric importance sampling (NIS). Nonparametric approximations 
based on kernel estimators for the construction of proposals have been used before (West 
1992, 1993; Givens and Raftery 1996; Kim et al. 2000). Under restrictive conditions it has 
been shown that nonparametric (unnormalized) IS can not only reduce the variance of the 
estimator but may also improve its rate of convergence of the mean square error (MSE) to 
Q(^]\[-id+^)/{d+'i)^ (Zhang 1996). Except for special cases, parametric importance samphng 
strategies achieve the standard MC rate of 0{N^^), as the optimal proposal is typically not 
included in the employed distribution family. There is still a lack of theoretical results for 
NIS, particularly for the self-normalized importance sampler. Furthermore, computationally 
aspects, that critically effect the performance of NIS, have only been insufficiently treated 
in the literature (Zlochin and Baram 2002). 

The competitiveness of NIS compared to parametric IS heavily relies on the computational 
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efficiency of the employed nonparametric estimator. This article introduces NIS algorithms 
based on a multivariate frequency polygon estimator which we show to be computationally 
superior to kernel estimators. Furthermore, our nonparametric estimator allows the combi- 
nation of NIS with other variance reduction techniques (such as stratified samphng) which 
is another advantage over kernel estimators. We investigate NIS not only for IS but also for 
SIS which has not been done before. Under loose conditions on the integrand, the MSE con- 
vergence properties of the proposed algorithms arc explored. The theoretical findings result 
in distinct suggestions for efficient application of NIS. The large potential of NIS to reduce 
MC variance is verified empirically by means of different integration problems. Overall, we 
provide strong evidence that our NIS algorithms solve well-known problems of existing NIS 
techniques. This suggests that NIS is a promising alternative to parametric IS in practical 
apphcations. 

The remainder of the paper is organized as follows. In Sections 2 and 3 we propose NIS 
algorithms for IS and SIS, respectively, and investigate their MSE convergence properties. 
In Section 4 we discuss the applicability of the suggested algorithms including parameter 
selection and implementation issues. Finally, in Section 5 and 6 wc present simulation 
results for three toy integration problems and for a spam filter queueing system based on 
real data. 

2. NONPARAMETRIC IMPORTANCE SAMPLING 

A NIS algorithm based on a kernel density estimator, that approximates the analytically 
unavailable optimal proposal q^^, is considered in Zhang (1996). Theoretical and empirical 
evidence of the usefulness of this approach has been established. In particular, it was proved 
that NIS may yield MSE convergence of order (9(A^~(''+®)/('^+^)) essentially under the very 
restrictive assumption that (pp has compact support on which (p is strictly positive. The 
theoretical results derived in this paper get by with much weaker assumptions. From a 
practical point of view a kernel density estimator is computationally too demanding. For the 
purpose of NIS it does not suffice that the employed nonparametric estimator provides a fast 
and accurate approximation of the distribution of interest. It is also required to allow efficient 
sampling as well as fast evaluation at arbitrary points. As a computationally more efficient 
alternative to the kernel estimator, it is suggested to use a histogram estimator (Zhang 1996). 
The drawback of a histogram is its slow convergence rate of (!?(A'"~^/(^+'^)) compared to kernel 
estimators, which typically achieve C)(A^"^/(^+'^)). In this paper we propose the usage of a 
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multivariate frequency polygon which is known as linear blend frequency polygon (LBFP) 
(Terrell 1983 cited in Scott 1992, p. 106). It is constructed by interpolation of histogram bin 
mid-points. Being computationally only sUghtly more expensive than ordinary histograms, 
it achieves the same convergence rate as standard kernel estimators. Consider a multivariate 
histogram estimator with bin height for bin Bki^..,^kd — Y[i=i[^ki — h/2,tki + h/2) 

where h is the bin width and {t^^, . . . ,tk^) the bin mid-point. For x G HiLiI^fci' ^fc, + h) the 
LBFP estimator is defined as 



h,-,jde{o,i} 
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(3) 



It can be shown that / integrates to one. 

Our NIS algorithm consists of two steps. In the first step the optimal proposal is 
estimated nonparametrically using samples drawn from a trial distribution and weighted 
according to the importance ratio q!^ /qq- In the second step an ordinary importance sampfing 
is carried out, subject to the proposal estimated in the first step. Before we can state the 
algorithm we need to introduce the following quantities. Let Am be an increasing sequence 
of compact sets defined by Am = {x G M"' : g^^(x) > cm}, where cm > and cm as 
M goes to infinity. For any function g we denote the restriction of g on Am by gM and we 
abbreviate Qm = q^^j^^- Furthermore, the volume of Am is denoted by Vm- Note that, by 
definition. Am converges to the support of q^^. The theorems in this section consider the 
following algorithm, which is related to the NIS algorithm in Zhang (1996). 

Algorithm 1 - Nonparametric Importance Sampling 

Step 1: Proposal estimation 



For j — 1, . . . , M: Sample x^ ~ go- 

Obtain estimate g^(x) = ^ig±^l^^(x), 

where uJm = 1/-^E^i'^m> = kM(x-^)|p(x-')go(x^)~\ and 



M 

X 



i=l 



h 



{tki,tki+h) 
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ioj:xeYlt=i[tk,:h, + h). 
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Step 2: Importance Sampling 

• For i = 1, . . . , N — M: Generate sample x* from proposal g^. 
. Evaluate 1^1^ = {N - M)-' ^i/' ^M{^')p{^')qf,{^T'- 

Both Am and 6m are required in the proofs of the Theorems below but they can be omitted 
in practice. 

Assumption 1 Both (p and p have three continuous and square integrable derivatives on 
suppdv^lp) and \ip\p is bounded. Furthermore, J(V^\ip\p)'^{\ip\p)~^ < oo where V^lv^lp = 
d'^\(p\p/dxl + . . . + d'^\(p\p/dx% 
Assumption 2 E[|(y9|pgQ "'^j'* is finite on suppdv^lp). 

Assumption 3 As total sample size — oo, bin width h satisfies h ^ and Mh'^ oo. 
Additionally, we have 5m > 0, Vm5m = o{h'^) and M^{Vm5mY oo. 

Assumption 4a cm guaranties ^'^+(^^^ ) — = ) — ^ h +{Mh ) ^ 

Assumption 5a cm guaranties (/ gjf l{gis<CM})^ = o{M~^h'^ + {M'^h'^y'^). 

For fixed sample size M and conditional on the samples {x^,x^, . . . ,x*^} it is not hard to 
show that I^J^ is an unbiased estimator with variance 

For the special case > we have = i^MpI^l, and ^ can be rewritten as 



N-MJ gll(x) 

Under the foregoing assumptions we now prove that the variance (jsj attains convergence 
rate (9(A^~('^+^)/('^+^)), if bin width h is chosen optimally. 

Theorem 1. Suppose Assumptions 1-3, ^a, 5a hold, > 0, and q = q^. We obtain 



and the optimal bin width 



h* = — ^— M-d+4 
V4i/i3'^ ) 

whcTC 

49 ^ /-(W^l^ fdhd]q ^ _ f q 



Hi = — > / ^^^^ + — > / H.^ , . 

2880 ^7 q 64^7 q J q^ 
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Proof. See Appendix A. 

A direct implication of Theorem [T] is the following corollary. 

Corollary 2. Under the assumptions of Theorem^ and the further assumption that M/N 
A (0 < A < 1), and h = h* we yield 

1 2 



d+S 

lim Nd+^E 



jNIS _ J 



A-d+4(i _ xy^ X I^D 



and optimal proportion A* = 4/((i + 8), 

where D = {(rf/4) 4/(^^+4) + (rf/4)-'^/(^+4)} [fff (2'^3-'^ff2)'] '^^''^'^ 

We remark that under much stronger assumptions corresponding results for NIS based on 
kernel estimators were obtained in Zhang (1996). 

We now move to a more general case. Assume > (and < 0) does not hold. For 
this case we show that the NIS algorithm asymptotically achieves the minimum importance 
sampling variance. By substituting the optimal IS distribution q^^ into variance CTj^g and 
writing shorthand I^p = J |<^(x)|p(x)(ix, we see the optimal variance of the IS estimator to 
be ll - II- 

Assumption 4b cm guaranties ^'^^j^^s — = o( ^ ) — \ ^^^^ h +{Nih ) ^ 

Assumption 5b cm guaranties (/ g^^ljgis^c^,^})^ = o{M^'^h'^ + {M'^h'^y'^). 

Theorem 3. Suppose that Assumptions 1-3, 4b, 5b hold, (p does not have a definite sign, 
and q = q^. Then we obtain 



T^\fNIS 7-12 1 



<PM s^j N - M 

and the optimal bin width 



h** = 



wh 



ere H, = - (j f^^p + J f,^), H, = (j f^-2j J and f. 



•£P _ \f\P 

lip I to 



Proof. See Appendix A. 

As a consequence of Theorem [3| the NIS algorithm does not lead to a MSE rate improve- 
ment for functions (p which take positive and negative values. But if the optimal bin width 
h** is used, we have 



6 



That is, the optimal IS variance is achieved asymptotically. Unhke Theorem [T], the optimal 
proportion A cannot be computed analytically due to its dependency on N. But theoretically 
it can be computed as A** = argmin;^G'(A^, /i**, A) where G = E[/^]^^ - J^]^. Clearly, A** 
decreases in N. Note, that for the optimal asymptotic variance to be achieved, it suffices 
that < A < 1. 

Corollary [2] and Theorem |3] suggest that IS based Monte Carlo integration can be much 
more efficient for functions > (and f < 0) than for arbitrary functions. This stems 
from the fact that for non-negative (non-positive) functions the usage of the optimal pro- 
posal leads to a zero variance estimator. By approximating the optimal proposal with a 
consistent estimator it is therefore not surprising that the standard MC rate can be sur- 
mounted. Consequently, it should be reasonable to decompose ip into positive and negative 
part, {p = ip^ — ip" , and apply Algorithm 1 to ip~^ and (p~ separately. Since then, we can 
expect to achieve the superior rate C'(A^"(^+8)/{°'+4))_ ]\jote that the partitioning of (f needs 
not to be done analytically. It may be carried out implicitly in Step 1 of the algorithm. This 
approach, denoted by NIS+/-, is investigated in a simulation study in Section 5. 

3. NONPARAMETRIC SELF-NORMALIZED IMPORTANCE SAMPLING 

Many problems in Bayesian inference can be written as the expectation of some function 
of interest, ip, with respect to the posterior distribution p which is only known up to some 
constant. This leads to the evaluation of integrals 

/V?(x)p(x)c;x 
'^^^ = /p(x)rfx ' 

where p = ap with unknown constant a. Self-normalized IS is a standard approach for 
solving such problems. It is often suggested to choose the proposal close to the posterior. 
But from the CLT we know that one can do better by choosing it close to the optimal proposal 
which is proportional to \ip — I^p\p. Next, we introduce a nonparametric self-normahzed IS 
(NSIS) algorithm. 

In analogy to the definition of Am we define Ajv/ = {x G M'^ : g^^^(x) > cm}, where cm > 
and Cm ^ as M goes to infinity. Its volume is denoted by Vm- 

Algorithm 2 - Nonparametric Self-Normalized Importance Sampling 

Step 1: Proposal estimation 

• For j = 1, . . . , M: Sample x-' ~ go- 
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Obtain estimate gfl^fx) = /^(xj+^M /^n 



~j ~j 



where ujm = '^/MY,j=i^M^ = I'^m^^^) - b(^^)'?o(x^) \ /m(x) analogous to 
Algorithm 1, and 



^tep 2: Self-Normalized Importance Sampling 

• For i = 1, . . . , N — M: Generate sample x* from proposal gfp. 

• Evaluate 

Vm ^N-M ~ , j\ 



where wm(x*) = p(x*)g|P(x*) ^ 



Both the SIS and NSIS estimator produce biased estimates. But, however, the estimators 
are asymptotically unbiased. Under Assumptions 1-3 (with p, \ip\, cm, Vm replaced by p, 
\ip — /<^|, Cm, Vm) it is easy to verify that, conditional on the samples {x^,x^, . . . ,x*^}, the 
CLT of Geweke (1989) holds for The asymptotic variance of the CLT can be written 

as 



2 _ t2 



(6) 



. ^ / (g!}^(x)-gif(x)) 



with I^p^j = J |y9A/(x) — /<^^Jp(x)(ix the median of ip. Consequently, /^^^ is the (asymptoti- 
cally) optimal variance that can be achieved by self-normalized importance sampling. Unless 
(p is constant, it is impossible to build up a zero variance estimator based on SIS. This ren- 
ders it unnecessary to investigate separately the MSE convergence of NSIS for non-negative 
and arbitrary functions. 

The structure of (jgjg is very similar to the structure of the variance in ^ but the weights 
uj^^ introduce inter-sample dependencies which make the reasoning in the proofs of Theorem 
[T]and Theorem [3] not directly apphcable. However, similarly to Theorem [3] we can show that 
the NSIS asymptotically attains optimal variance for certain bin width h and proportion 
< A < 1. 

Theorem 4. Suppose that Assumptions 1-3, 4(i, 5a (with p, \ip\, cm, Vm replaced by p, 
\ip — Cm, Vm ) hold, and q = q^^^. Then we obtain 



T^lfNSIS T 12 



VM ^\ N - M 



1 + h^Hi + — -H2 



X (1+0(1)) 
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and the optimal bin width 



dH22'' 



1 

i+4 



1 



where Hi and H2 are defined as in Theorem\n (with replaced by q^^^). 



Proof. See Appendix A. 

First, note that analogous to Theorem [3| there is no analytic solution for the optimal A. 
Second, the theorem imphes that with NSIS the MSE rate cannot be improved. Therefore, 
NSIS is (at least asymptotically) less efficient than NIS+/-. There is consequently no reason 
to apply NSIS in cases where NIS+/- is apphcable. However, this does not impair the 
usefulness of NSIS in cases where normalization is required due to unknown constants. 

4. APPLYING NONPARAMETRIC IMPORTANCE SAMPLING 

In this section we discuss what is required for implementing NIS/NSIS. First, one need to take 
care of the selection of go? and A. Second, an implementation of the LBFP estimator which 
allows the generation of samples is required. Given these ingredients the implementation of 
Algorithm 1 and 2 is straightforward. 

4.1 Parameter Selection 

(i) From a practical point of view trial distribution go should be chosen such that its support 
is close to the support of \(p\p ov \(p — respectively, and such that it has heavier tails 
than the corresponding optimal proposal. But it is not required that go emulates any struc- 
ture of the optimal proposal. Obviously, the choice should also comply with Assumption 
2. Note that the expectations in the assumptions may not exist, if go is too close to the 
optimal proposals. In addition, it is important to choose an easy-to-sample density. For 
low-dimensional problems, even a uniform distribution may suffice. 

(ii) As the optimal bin width incorporates unknown quantities dependent on the optimal 
proposal, it typically cannot be computed analytically. The unknown quantities can be 
estimated using the plug-in method based on the samples of Step 1 of the algorithms, as 
suggested in Zhang (1996). If the second derivative of the optimal proposal is unknown, the 
plug-in method cannot be apphed. In this case, a Gaussian reference rule is an alternative, 
(in) Except for the case investigated in Theorem [T] and Corollary [2| where the optimal 
proportion A* is given by a beautifully easy expression only depending on the problem 
dimension, it is not clear how to choose A. However, from the MSE error expressions in the 
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theorems we know that A* (from Corollary [2]) serves as an upper bound. Empirical evidence 
suggests that A should never exceed .25. 

(iv) In practical applications the restriction of the estimator on a compact set Am can be 
omitted as the induced bias can be made arbitrarily small and particularly smaller than the 
desired precision of the integral value. Hence, the sequence cm does not need to be defined. 
Sequence 6m can also be skipped in practice as mentioned before. 

4.2 Implementing the LBFP estimator 

The implementation of the LBFP estimator / should take into account efficient sampling 
and evaluation. Given the multivariate histogram defined through bin heights /l^^ the 
implementation of the evaluation of / is simple (see We emphasize that for storing / 
on a computer it suffices to store the underlying histogram. Sampling from a LBFP is more 
involved than evaluation and to the author's knowledge this has not been discussed in the 
literature until now. We propose to apply the inversion method. The crucial fact is that a 
LBFP can be written as a product of (conditional) univariate frequency polygons (FP) 

d 

/(x)=/'^p(xi)nf^(^.i^i.-i) 

1=2 

with {a;i:j__i} = {xi, . . . , Xj.i}. This representation suggests to produce draws from / by 
samphng recursively from the univariate FPs /^^ using their inverse cumulative distribution 
functions. A FP is a convenient object as it is just a hnear interpolated univariate histogram. 
Furthermore we have 

/ {Xi\xi.,i^i) = -J- (7) 

where f{xi;i) are (marginalized) LBFPs, i = 1, . . . , (i. We will see below that the f^^ {xi\xi;i^i) 
are not required itself but their cumulative distribution functions F(xj|xi:j_i). As FPs are 
piecewise hnear functions and due to relation ([t]) the latter are obtained without difficulty 
provided that LBFPs f{xi;i) can be evaluated. Hence it is required to calculate the marginal- 
ized histograms underlying the LBFPs f{xi..i). These are specified through bins -Bfci,...,^^ and 
bin heights 

Let y = {yi, . . . e [0, 1)"' and ?/i e [F{tk^\xi.,i-i), F{tk,+i\xi.,i-i)). We now describe how 
the inverse cumulative distribution functions F~^(-|xi:j_i) of /^^(xi|xi;j-_i) can be evaluated 
at Hi by making use of F(xj|xi:j_i). It is easy to see that, for Xi G [tfc^,tfc,+i), j {xi\xi,i^i) 
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is a linear function with intercept a and slope (3 where 

f {Xi,,^l, tkj 1 /(Xl:i_l,tfc^ + l) 

a = — ^ and P = 7" 

f{Xl:i-l) " [ f{Xl:i-l) 

Hence is the solution of the quadratic equation 

Ui- F{tk^\xi;i-i) = I P^{xi\xi;i-i)dxi = az +^z'^ (8) 

which is given by 



I [(72 - yi)tk, + (z/i - 7i)4,+i] / (72 - 7i) for /3 = 0, 

where 71 = F(tfcja;i:i_i) and 72 = 

Summarizing, a sample x-' from the LBFP / is obtained through the following recursion. 
Let be a sample from the uniform distribution on [0, 1)'^. Then, for i = 1, . . . , 

1. Compute the marginahzed histogram associated with LBFP f{xi.,i). 

2. Calculate cumulative distribution function F{xi\x^^.^_-^ (or F{xi) for i = 1) at the 
(marginal) bin mid points t^. using ([T]). 

3. Evaluate x{ = F~^{yi\x{.i_^) (or x{ = F~^{y() for i = 1) using (|9). 

We remark that for generating N samples Step 1 needs only to be carried out once as it is 
independent of the particular sample x-'. Our C++ implementation of the LBFP is available 
on request. 

4.3 Computational Remarks 

Now the computational complexity of the LBFP is discussed. For h = h* it can be shown 
that the complexity for generating N samples from a LBFP is of order 0{2'^(PN^''-^^^^^'^^^^) 
(see Appendix B for details). The complexity of evaluation is of lower order. Compared 
to crude MC which has 0{dN) sampling from a LBFP is only shghtly more expensive for 
d small. For kernel estimators sampling and evaluation is of order 0{dN'^) (Zlochin and 
Baram 2002) proving that the LBFP is computationally more efficient for all relevant d and 
N. Note, more efficient sampling from kernel estimates is possible using regularization with 
whitening (see for instance Musso et al. 2001). But this can induce severe bias if the target 
distribution is non-Gaussian. 
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Besides asymptotic complexity properties there are other computational aspects which are 
of relevance in practice. On computer systems, the evaluation of functions such as exp and 
pow is known to be much more expensive than standard arithmetic operations. Contrary 
to most parametric IS approaches, nonparametric IS methods do not require calls to those 
functions. 

5. SIMULATIONS 

We consider three toy examples in order to test our nonparametric procedures against (para- 
metric) alternatives. The first two examples are designed to evaluate certain properties of 
the NIS algorithm and to demonstrate the degraded performance of the NSIS algorithm. 
The third example is a two-dimensional benchmark problem for self-normalized importance 
samphng. 

A reasonable measure for the effect of a variance reduction technique is the relative effi- 
ciency (RE). It is defined as the ratio of the crude MC MSE to the MSE of the method of 
interest. In the case that both estimators are unbiased, the RE is also known as variance 
reduction factor. The performance of the different algorithms will be measured by RE and 
computation time. In all examples the simulation is done for sample sizes = 1,000, A^ = 
5,000, and A^ = 10,000. All computation were carried out on a Dell Precision PWS390, Intel 
CPU 2.66GHz, and the algorithms are coded in C++. For pseudo random number generation 
we used the Mersenne Twister 19937 (Matsumoto and Nishimura 1998). All computation 
times are reported in milliseconds. 

Example 1. As our first example we consider a simple integrand that is to be integrated 
with respect to the standard normal distribution of dimension d. The integrand is defined by 
(y9(x) = i]d(x). It takes positive and negative values on the d-dimensional unit cube. 

This allows the evaluation of the strategy to apply Algorithm 1 separately to the positive 
and negative part of the integrand (NIS+/-). In our simulation d varies from 1 to 8. The 
trial distribution go is set to the uniform distribution on [—1, 1]"' and the bin width h is 
chosen with the plug-in method. A is set to .15 and to the optimal value 4/9 for NIS and 
NIS+/-, respectively. In order to obtain comparable results, for NIS+/- the total sample 
size is equally spread to the integration of the positive and negative part. 

Table [T] shows the RE and computation times for MC, NIS, NIS+/-, and ordinary IS 
(subject to the uniform distribution on [— l,!]"^). The RE figures for NIS+/- report large 
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variance reduction which is present at least up to dimension d = 8. Even if we take com- 
putation time into account, we find significant efficiency improvement: For instance, for 
d = A and = 10,000 we obtain RE of 22 whereas the computation time surplus factor is 
about 7. Also note, that IS becomes more favorable as d increases. In order to investigate 
the computationally efficiency we plotted MSE x computation time (Figure [T]). Contrary 
to RE, smaller values are favourable. We observe that the critical dimension, up to which 
NIS+/- is computationally more efficient than the other methods, strongly depends on the 
magnitude of A^. Whereas for = 1,000 one would prefer NIS+/- to IS only for d = 1, 
for = 10,000 one would do so up to c? = 4. Finally, the convergence of the NIS variance 
towards the optimal IS variance is examined. The minimum IS variance — is approx- 
imately .098 and 0.0099 for d = 1,4, respectively. In Figure [2] the estimated variances of 
NISx(l — A)A^ are plotted for 100 < N < 2,500. The plots indicate rapid convergence to the 
optimal values. For comparison: the variance of crude MCxA^ is roughly .198 (for d = 1) 
and .063 (for = 4). 

Example 2. This example is concerned with the pricing of a call option within the Black- 
Scholes model. Given interest rate r and volatility a the evolution of a stock is described 
by the stochastic differential equation (SDE) dS{t)/S{t) = rdt + adW{t) with standard 
Brownian motion W. The solution of the SDE is given by S{T) = S{0) exp[(r — 0.5a^)T + 
aVrZ] where Z is a standard normal random variable. At time T, the call option pays the 
amount {S{T) — K)^ depending on the strike level K. The price of the option at time is 
given by the expectation E[F(Z)] of the discounted payoff F{Z) = exp{—rT){S{T) — K)^ . 
That is, the pricing problem reduces to the integration of a payoff function with respect to 
the standard normal distribution. Parametric IS is a standard variance reduction technique 
for option pricing. A shifted standard normal distribution is often used as proposal. This 
approach is known as change of drift technique. In our simple model the (asymptotically) 
optimal drift is given by argmax^ log F(z) — .5z^ (Glasserman et al. 1999). We state the 
simulation results for the optimal change of drift IS (GDIS) as parametric benchmark. 

For our simulation we set 5(0) = 100, r = .1, cr = .2, T = 1. The option price is 
estimated for the strikes Ki = 90 and K2 = 130. For Ki the option is said to be in the 
money {Ki < S{0)) where for K2 it is called out-of-the money {K2 > S{0)). The latter case 
is particularly suited for IS techniques, as crude MG fails to satisfactorily sample into the 
domain that affects the option price, go is set to the uniform distribution on [—5, 5] and bin 
width h is selected using the plug-in method. A is set to the optimal value 4/9 for the NIS 
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and to .05 for NSIS. 

The efficiency improvements of the IS methods relative to crude Monte Carlo integra- 
tion (RE) are shown in Figure |3] Whereas parametric IS methods and NSIS yield constant 
reduction factors, NIS reahzes increasing relative efficiency which coincides with its theoret- 
ical superior convergence rate. Particularly for the out-of-the money scenario, NIS achieves 
massive variance reduction. Establishing only shght variance reduction NSIS is worst. This 
confirms our recommendation to avoid NSIS where NIS is apphcable. Figure |4] shows the pro- 
posals employed in the simulation for strike K2. The optimal IS proposal is single-moded and 
can be reasonably approximated by some Gaussian distribution. This explains the satisfying 
performance of IS methods based on Gaussian proposals reported in the literature. How- 
ever, NIS significantly outperforms GDIS. For more complex payoffs implying multimodal 
optimal proposals, the advantage of NIS should be even more pronounced. Gomputation 
times for different sample sizes are reported in Table [2] First, notice that GDIS is much 
more expensive than MG due to massive evaluation of the exp function whilst computing 
the hkehhood ratios. Second, the computational burden of NIS increases sub-hnearly for our 
sample sizes. This is due to initial computation for the LBFP, which is roughly independent 
of A^. Remarkably, NIS is computationally cheaper than GDIS for = 10,000. 
Example 3. The last example is a two-dimensional benchmark integration problem dis- 
cussed in Givens and Raftery (1996). The density of interest p{xi^X2) is given by Xi ~ 
W[-l,4] and X2IX1 ~ A^dXil, .09a^). We investigate the cases a = .75 and a = 3.5. This 
kind of density also occurs in work on whale modehng (Raftery 1995). Small values for a im- 
ply a strong nonlinear dependency between Xi and X2. As a becomes larger the dependency 
vanishes in favor of a more diffuse relationship (see Figure [5]). Following Givens and Raftery 
(1996), we use this scenario for comparing self-normalized IS algorithms. NSIS is tested 
against SIS with proposal equal to the uniform distribution on [—4,7] x [—4,8]. The same 
uniform distribution is used as trial distribution go in the NSIS algorithm. We compute the 
expectation of functions ipi{xi,X2) = X2 and (p2{xi,X2) = l{xi<o}(a^i? 2^2)- The parameters of 
NSIS are set as follows: A = .2 and /i = 1.54, 1.224, 1.09 (for A^ = 1,250, 5,000, 10,000). For 
comparison, we also state the results of two other nonparametric algorithms, namely GAIS 
and LAIS (West 1992; Givens and Raftery 1996). GAIS and LAIS are adaptive nonpara- 
metric IS methods, that estimate distribution p with adaptive envelope refinements based 
on nonparamtric kernel estimators. Density p and the optimal SIS proposals are shown in 
Figure [5| They are rather far away from the initial guess go- Table [3] shows the relative 
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efficiency of NSIS, GAIS, and LAIS with respect to SIS for the two functions and the two 
different values of a. The figures for GAIS and LAIS were reprocessed from Givens and 
Raftery (1996). For N = 5,000, NSIS is clearly the method of choice. 

6. APPLICATION 

We investigate a spam filter queueing systems with real data. Queueing system are an active 
field of research (see for instance Lazowska 1984; Asmussen 2003). Numerous applications are 
readily available. The most basic queueing system, denoted briefiy by M|M|1, consists of a 
single server and a single waiting room (with infinite capacity). The interarrival and service 
times of the jobs are exponential distributed with parameter /i and u, respectively. This 
model is well understood theoretically but usually too restrictive for real world applications. 
In our case, e-mails arrive at a spam filter that decides whether or not a particular e-mail 
is spam. The data consist of interarrival times ti (in seconds) and service times Si (in 
milliseconds) for n = 22, 248 e-mails. The data were recorded between 8am and 8pm on 8 
business days in September 2008 and are available on request. (We are grateful to J. Kunkel 
for providing the data.) The system that produced the data is a single queue, dual server 
system, i.e. the e-mails are processed by two parallel spam filter threads. In the following 
we investigate both the single and the dual server case. The empirical distributions of the 
interarrival and service times are displayed in Figure |6] We observe that the former is well 
approximated by an exponential distribution with parameter ft = n/J2^=iti = -074 (which 
is the maximum likelihood estimate). On the contrary, for the service time distribution it 
is hard to find a parametric model. Therefore we employ a LBFP estimate. (Note that 
a kernel estimator is inappropriate as heavy sampling from the service time distribution 
is required.) The bin width was selected with the Gaussian reference rule for frequency 
polygons h = 2.15crn~^^^ (Terrell and Scott 1985) with a being the standard deviation of 
the service times Sj. 

We are interested in the probability that the queue length reaches a certain level K. 
This is a typical problem in queueing systems with rare events being of particular interest. 
Importance sampling is a standard variance reduction technique for this task (see for instance 
Glynn and Iglehart 1989; Glasserman and Kou 1995; Kim et al. 2000). For estimating the 
probabilities we simulate N busy periods and count the number of periods in which level K 
was reached. A busy period begins when an e-mail has arrived in an empty system and ends 
when either the system is empty again or the queue length has reached level K. Let Ui be 
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the sample path of the queue length in the ith busy period resulting from samples and 
yf drawn from the interarrival distribution pt and service time distribution ps, respectively. 
In the dual server case represent the service times of both servers. The MC estimate of 
the probability of interest is Ik = '^/NJ2iLiV{^i) where ip{uji) = 1 if cjj reaches K and 
else. Assume the number of e-mails that have been served in the ith busy period is Lj. Then 
there must he K + — 1 arrivals in this period for the queue to reach level K. (Note, a 
busy period starts with one job in the queue.) Hence, if importance sampling is used the 
estimator becomes 



with likelihood ratio 



i=l 



,=1 qti^) Qsiyf) 

and proposals qt, Qs- Here NIS works as follows: We simulate M busy periods by samphng 
interarrival times and service times yf from trial distributions go,t and go,s, respectively, 
and obtain sample paths cjj, i = 1, . . . , M. Let J = {i G {1, . . . , M}, (^{uoi) = 1}. For esti- 
mation of the optimal proposals we use those times x^, yf with i e2. The interarrival time 
proposal qt is estimated parametrically by using an exponential distribution with parameter 

K+L^-l K+Li~l 

A = E E ^^/E E (10) 

i£j j = l i£T j = l 

where wj = pt(x^)/go,t(x^). The service time proposal is estimated nonparametrically (as 
in Algorithm 1) based on samples yf and weights = Ps{yt) / (lo,s{y^)-, i 

For our simulation we set iV = 1 Mio. , A = .15, and the trial distribution go,s equal to 
the LBFP estimate of the service distribution. For M|M|1 systems it is well known that 
(asymptotically) optimal proposals are achieved by swapping the parameters /i and u. For 
this reason go,* is set to the Exponential distribution with parameter i) = n/ Yll=i = 0.147. 
As parametric IS benchmark we consider the IS scheme that carries out IS for the interarrival 



times only. It uses the Exponential distribution with parameter /t defined in (10) as proposal. 

We compare MC, IS, and NIS in terms of the coefficient of variation (CV) and RE. The 
former is defined as the ratio of the standard deviation to the mean of the probability 
estimate. Note that for CV smaller values are favourable. The results are summarized in 
Tables [4] and m Where no figure is given, the MC estimator was zero. We find that as the 
event of interest becomes rarer NIS becomes more favourable. This holds for both the single 
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and dual server case. The NIS probability estimates for different queue levels K are shown 
in Figure [7j No error bounds are given as they are very small for the large number of busy 
periods used. 

Real-world queueing systems typically involve complicated distributions such as the service 
time distribution in our case. Therefore, it is often impossible to set up parametric IS schemes 
for simulation. Here, NIS has a distinct advantage. The extension of NIS to the recently 
proposed state-dependent IS schemes for queueing systems is part of our current research. 

7. CONCLUDING REMARKS 

Contrary to other articles on nonparametric importance sampling, we favored the LBFP in- 
stead of kernel estimators. As shown in Section 4, draws from a LBFP can be generated using 
the inversion method. As the inversion method is a monotone transformation, it preserves 
the structure of the pre-sampled uniformly distributed variates. This offers the opportunity 
to combine NIS/NSIS with other variance reduction techniques such as stratified sampling, 
moment matching, and quasi MC techniques (Robert and Casella 2004; Glasserman 2004). 

In financial engineering and many other fields, integration problems are often high-dimensional. 
Due to the curse of dimensionality and increasing computational complexity, the direct ap- 
plication of NIS is intractable for large dimensions. However, dimension reduction techniques 
such as principal component analysis, the Brownian Bridge, or the screening method can be 
apphed to break down the required integration task to moderate dimensions (Glasserman 
2004; Rubinstein 2007). 

Furthermore, we emphasize that the LBFP estimator is not restricted to the usage within 
nonparametric importance sampling. It is a reasonable alternative to other nonparametric 
estimators whenever sampling and evaluation is required. 

APPENDIX A 

Proof of Theorem [l]. We denote and briefiy by qm and qm- Since for > we 
have qm = '^Mpl^l,, the variance a\j of J^]^ (conditional on {x\ . . . is given by 

(A^-M).L = 4,/ '^"W-^;'W)' dx. (11) 



In order to get rid of gA^(x) in the denominator we write 



N-M 



(gM(x) - gAf(x)) 



2 



-fix 



Km + Rm- 



E 



(gAf (x) - gM(x)) 
gA/(x)gM(x) 



-fix 
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The discrepancy between qm and qm can be investigated by 

gA,/(x) - gjv/(x) : 



/a/(x) - UMQui?^) ^ 5a/(1 - VMQMi?^)) 



+ Vm^M 



+ 



/m(x) - UJMqMi^) 



'■fM 



'■'Pm 



[12) 



It will be established below that Y^[WM{^)f = 0{h^ + {Mh'^Y^). Now we show that 
E[f/{^(x) + f/|^(x)]^ is of lower order. Under Assumptions 1, 2 we yield 



< C{VmSm)' + C\ E 



'■'PM 



M 



c 



M 



1/2 



1/2 



The last inequality follows analogously to Lemma 1, 2 in Zhang (1996). Since by Assumption 
3 VMhi = o{h^) and M^VmSm)^ ^ oo, we obtain E[t/l,(x) + f/|,(x)]2 = o(E[W^m(x)]2). 
We conclude Km ^ J E[WM{^f]qM{^)d^. 

It is not hard to work out that / E[VrM(x)^]g^/(x)(ix decomposes into an integrated 
squared bias term Li and an integrated variance term L2 



E[A,(x)/-M-gM(x))^ 



(ix 



Var[/M(x)/,-^] ^^ ^ = + + 0(M-^) 

gM(x) 



For notational convenience the following is only shown for d = 1. Without loss of generality 
we assume x G [—h/2, h/2). Then ful^lj simplifies to 



/m(x) _ fh/2-^\ 



h 



+ 



V2 + x\ fl 



h 



cUH 



(13) 



where /q"" = 1/M ^ -=1 ^If l[-h,o)(i^) and /™ = 1/M ^ -li cc'l,l[o,h)(i^) are the heights of 
bins [—h,0), respectively, [0,h) of an unnormalized histogram. For the computation of Li 
we need to compare the Taylor expansions of E[/a/(x)J~2^] and qm which are given by 

E[/m(x)/;^^] = gM(0)+xg;,(0) + /iM,(0)/6 + O(/i3), 
gAf(x) = gM(0)+xg;,(0)+xX/(0)/2 + O(/i3). 



The former follows from (13) and from the expansion of the histogram 

E[/o"/!/;i] = gM(0) -/+ /^g;,(0)/2 + eq'U0)/Q + 0{h'). 
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Thus we obtain (E[/m(x)/-^i ]-gM(x))2 ^ {h^-3^'^fqlj{0f/36. Integration over [-h/2, h/2) 
and using Taylor expansion of l/gM(x) about leads to 



36 J_h/2 qui^) 2880 gM(0) ^ ^ ^ ' 

By summing over all bins and applying standard Riemann approximation we yield 

2880 J QMi^) ^ ' ^ ' 



Next we derive an approximation to L2. From (13) we have 



Var[/M(x)/- ] = (^) Var[r /- ] + (^) Var[/-/- 

+ ^^^^^cov[r/-,/r/.-^ 



/iV 2-2x^ 

cUHr-n ^ <1m{0? 9M(0)!w.--n 1 o^^ n.^,.r/UH r-l /UHr-l 



In addition, it can be shown that Varf/UHj-^J ^ _^|Ayi__^ fo^ 2 = 0, 1 and Cov[/oUHJ-i , /uhj^^^^ 



gM(o) 

M 



analogously to Scott (1992, chap. 4). That is we yield 

Var|Mx)/;;,l^(^ + -gl)^.0(M-.,. 



Analogously to (14) and (|15[) we then obtain 



V2 <;«(x) 3M*(0) 



and 



/ ^dx + 0(M-i) 
i ?o(x) 



SM/i 7 go(x) 

respectively. Very similar computations in the multivariate case yield 

2'^ 

Km ^ h'HM,i + ^^,Hm,2 

where 

It remains to show that Rm is negligible compared to Km- The construction of qm implies 
dM > Sm{^m + Vivf^A,/)^ > 0. Under Assumption 1 and 4a it can be shown that Rm is 
negligible following the same lines as in Zhang (1996). 



19 



The proof is finished by noting that the squared bias term in E[/^]^ — J^]^ is neghgible due 
to Assumption 5a and that the expressions J^^^ (in (11 )), Hm,i^ and Hm,2 can be substituted 
by their unrestricted counterparts as their diff'erences are of lower order. 

Proof of Theorem |3[ Again qm is shorthand for g]^. Let /^^, = (^g^ - kM^y Straight 
forward calculations yield 



[N - M)aM = I^,, / ( -J J + qM - Qm ] Qm 



r 



2 

^ qMQM J Qm J Qm J Qm 



J — + qM - qm 1 Qm 



Term T4 is independent of the nonparametric estimation and we have I'^^T4^ = I^j^^ — I"^^- 
The expectation of term Ti can be written as 

2 E[gM - Qm] f ,2 ^[Qm - QmY , f 1-2 E[gAf - QAif 



fvM 2 / fvM 3 ^ / f-fiM — ^1,1 + ^1,2 + ^1,3- 

^ qii J tm J ImIm 

Similar expressions are obtained for quantities T2 and T3. We begin with Ti 1. Analogously 
to (12) we conclude gM(x) — gA/(x) ~ — [/m(x) — ^I;^/g^/(x)]//(pJ^^. From the proof of Theorem 
[T] we also know that 

E[/m(x) - coMqMi^)/!^,,] = E[A/(x)/;^\] - gM(x) = {h' - 3x2)g^,(0)/6 + 0{h') 
for (i = 1 and x G [— /?^/2, h/2). Then we obtain 

using a Taylor expansion of /^j,^(x)^/gj\/(x)^ about 0. Finally, summing over all bins and 
using Riemann approximation gives Ti 1 in the one-dimensional case 

In the multivariate case we yield 

Term Ti_2 can be treated analogously to J E[WM{^y]q^i{x)dx in the proof of Theorem [ij 
We end up with 



/.M(x)^^^rfx + 0(/.^). 



Tl,2 



2 



2880 ^7^-- ql, +QA^J^^^^ qf, 
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Comparing the term in brackets to Ti i we observe that the former is negligible. Furthermore, 
similarly to Rm in the proof of Theorem [Tj it follows that T13, is negligible compared to Ti 2 
provided that Assumption 4 holds. 

The calculations for T2 and T3 are very similar to those of Ti and therefore omitted. 
Putting all terms together we obtain 



2 V^gM f r V^9M 



qoqM 



X (l + o(l)) + (/ 



We observe that the terms restricted on M can be substituted by their asymptotic limits, 
which completes the proof. 



Proof of Theorem 



We denote g|P and briefly by qm and qm- As the bias of /, 



NSIS 



is asymptotically negligible we have E[/J^/s - = {N - M)-^E[al^s] x {1 + o(l)}. Thus, 
it suffices to examine E[(Tgjg] with agjg as in We obtain analogously to (12) 

gM(x) - gM(x) = ~ + + UmW- 



al. 



The crucial step for proving that the remainder term f/|^(x) + Ulj{'x) is of lower order is to 
show that under Assumptions 1, 2 



E[a/^„ -cJm]"' < CM 

for Z = 1,2 (compare Lemma 1, 2 in Zhang (1996)). We have 

. 1 ,^ 



" X] l'^A^(x^) - /^,,|p(i^)go(x^) ^ 
1 . 



and by applying the Minkowski inequality we obtain 



\2l 



< E 



«^^M - ~ V/IP(*^)^0(X-')"^ 



n 2i\ 21 



+ C E[V,-/, 



12/ 



C ^ ^^-1/2^) _ 
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Hence, we conclude that the remainder term is of lower order. Finally, we need to show that 

The main difference to Theorem [l] is the dependency of the weights uj^. Define = 
\(Pm{^^) — I^M\pi^'^)^o{'^^y^ , j = ^, ■ ■ ■ , M. As in the proof of Theorem [Ij let /^^ and 
/(y^ be unnormalized histogram bins based on weights ujI,j and u;;^^, respectively. It is not 
hard to show that E[/UH(a4j-i] = E[/UH(a/^^J~i] + 0(M-i/2) and Var[/UH(a4j-i] = 

Var[/^^(Q;J^^J~^] + C(M~^). The rest of the proof follows analogously to Theorem [Ij since 
weights are independent and the additional 0(M^^/^), 0{M^^) terms are negligible. 

APPENDIX B 

Let Bm be the number of bins. It follows that the number of bins in each marginal space 
is 0{B]1/). We begin with the analysis of the evaluation of a LBFP. Given location x 
we need to find the associated bin mid-points (t^^, . . . , tfc^) which is of order 0{dB]lf'). 
Then formula (jsj) can be evaluated which is 0{2'^d). Now observe Bm ~ VmI^'^ and h* = 
C(p(c/)i/('^+4)A^^i/(^+^)) with p{d) = d{2/3Y. By assuming that h = h* we obtain 0{dB]^/ + 
2'^d) ^ 0{p{dY^/'''^^^'^dN'^/'^'^^^^ + 2'^d) neglecting the slowly increasing sequence Vm- 

Sampling from a LBFP consists of the three steps described in Section 4.2. In Step 
1 the marginalized histograms corresponding to the LBFP i = 1, . . . ,d — 1, need 

to be calculated. This can be done recursively in 0{Bm)- In the second step F are to 
be computed at all bin mid-points tk^ using relation ([t]). Thus it is required to evalu- 
ate tfcj, i = 1,. . . ,d. This consists of searching the bin mid-points (t^i, • • • ,^fci_i) 
associated with and evaluating formula ^ as we have discussed above. It is suf- 
ficient to do the former once. Thus we end up with 0{dB]// + 2'^d x dB]j/') where the 
latter dB]lf' is due to the evaluation of F at all tfc- in each marginal dimension. Step 3 
has complexity 0{dB]lf') as in each marginal dimension the bin mid-point t^i satisfying ?/, G 
[F(tfc.|a;i:j__i), must be found. Putting all together we yield 0{BM+2''-d'^B]/f) 
the generating one sample. As above we assume h = h*^ substitute Bm ~ VM/h'^-, and omit 
Vm in order to derive 0{p{d)-'^/^'^+^'^N'^/^'^+^'^ +2'^d'^p{dy^l^'^+^'^N^I'^'^+^^). As Step 1 needs to 
be carried out only once and as p{dY^/'''^^^^ is small compared to 2'^d'^ we obtain approxi- 
mately 0{2'^d'^N^'^'^^'>/^'^'^'^^) for generating N samples. Finally, we remark that evaluations 
are neghgible compared to generating N samples. 
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/a/(x) - UMQui^) 



al. 



rfx ^ h'^Hi + 



Ho 



REFERENCES 



Asmussen, S. (2003), Applied Probability and Queues, New York: Springer. 

Geweke, J. (1989), "Bayesian Inference in Econometric Models using Monte Carlo Integra- 
tion," in Econometrica, 57, 1317-1339. 

Givens, G. H., and Raftcry, A. E. (1996), "Local Adaptive Importance Sampling for Multi- 
variate Densities With Strong Nonlinear Relationships," in Journal of American Sta- 
tistical Association, 91, 132-141. 

Glasscrman, P., and Kou, S.-G. (1995), "Analysis of an Importance Sampling Estimator 
for Tandem Queues," in ACM Transactions on Modeling and Computer Simulation, 5, 
22-42. 

Glasserman, P., Heidelberger, P., and Shahabuddin, P. (1999), "Asymptotically optimal 
importance sampling and stratification for pricing path-dependent options," in Math- 
ematical Finance, 9, 117-152. 

Glasserman, P. (2004), Monte Carlo Methods in Financial Engineering, New York: Springer. 

Glynn, P. W., and Iglehart, D. L. (1989), "Importance Samphngfor Stochastic Simulations," 
in Management Science, 35, 1367-1392. 

Kim, Y. B., Roh, D. S., and Lee, M. Y. (2000), "Nonparametric Adaptive Importance 
Sampling For Rare Event Simulation," in Winter Simulation Conference Proceedings, 
Vol. 1, 767-772. 

Kollman, C, Baggerly, K., Cox, D., and Picard, R. (1999), "Bayesian Inference in Econo- 
metric Models using Monte Carlo Integration," in Annals of Applied Probability, 9, 
391-412. 

Lazowska, E. D. (1984), Quantitative System Performance, Computer System Analysis 
Using Queuing Network Models, Prentice Hall. 

Matsumoto, M., and Nishimura, T. (1998), "Mersenne Twister: A 623-Dimensionally Equidis- 
tributed Uniform Pscudo- Random Number Generator," in ACM Transactions on Mod- 
eling and Computer Simulations, 8, 3-30. 

Musso, M., Oudjane, N., and Le Gland, F. (2001), "Improving Regularised Particle Filters," 
in Sequential Monte Carlo Methods in Practice, eds. A. Doucet, N. de Freitas and N. 
Gordon, New York: Springer. 

Oh, M. S., and Berger, J. (1992), "Adaptive Importance Samphng in Monte Carlo Integra- 
tion," in Journal of Statistical Computation and Simulation, 41, 143-168. 

— (1993), "Integration of Multimodal Functions by Monte Carlo Importance Sampling," in 
Journal of American Statistical Association, 88, 450-456. 



23 



Raftcry, A. E., Givcns, G. H., and Zch, J. E. (1995), "Inference from a Deterministic 
Population Dynamics Model for Bowhead Whales," in Journal of American Statistical 
Association, 90, 402-430. 

Robert, C. P., and Casella, G. (2004), Monte Carlo Statistical Methods, New York: Springer. 

Rubinstein, R. Y. (1981), Simulation and the Monte Carlo Method, New York: Wiley. 

Rubinstein, R. Y. (2007), "How to Deal with the Curse of Dimensionality of Likelihood 
Ratios in Monte Carlo Simulation," unpubhshed manuscript. 

Scott, D. W. (1992), Multivariate Density Estimation, New York: Wiley. 

Stadler, J. S., and Roy, S. (1993), "Adaptive Importance Sampling," in IEEE journal on 
selected areas in communications, 11, 309-316. 

Terrell, G. R., and Scott, D. W. (1985), "Oversmoothed Nonparametric Density Estimates," 
in Journal of American Statistical Association, 80, 209-214. 

West, M. (1992), "Modelling with Mixtures," in Bayesian Statistics 4, eds. J.M. Bernardo 
et al., Oxford UK: Oxford University Press, 503-524. 

— (1993), "Approximating Posterior Distributions by Mixtures," in Journal of Royal Sta- 
tistical Society, 55, 409-422. 

Zhang, P. (1996), "Nonparametric Importance Sampling," in Journal of American Statistical 
Association, 91, 1245-1253. 

Zlochin, M., and Baram, Y. (2002), "Efficient nonparametric importance sampling for 
Bayesian learning," in Neural Networks 2002, 2498-2502. 



TABLES AND FIGURES 



24 







AT 1 nnn 
iv = i,UUU 




iV = 0,UUU 




iV = iU,UUU 




iVlcLllUU. 


il 




_L iniG 




TimP 




TimP 


MC 


1 


1.0 


1 


1.0 


9 


1.0 


16 


IS 


1 


1.5 


3 


1.8 


16 


1.6 


31 


NIS 


1 


1.6 


13 


1.8 


28 


1.7 


50 


TVTTC 1 / 

NIS+/- 


1 


25.0 


13 


57.3 


24 


51.3 


40 


MC 


4 


1.0 


4 


1.0 


20 


1.0 


45 


IS 


4 


5.0 


7 


5.2 


38 


4.2 


80 


NIS 


4 


3.1 


112 


4.0 


234 


3.8 


408 


JNlh+/- 


4 


9.1 


105 


26.0 


195 


22.0 


326 


MC 


8 


1.0 


13 


1.0 


60 


1.0 


121 


IS 


8 


18.6 


20 


23.0 


104 


26.3 


209 


NIS 


8 


7.8 


600 


17.4 


1460 


5.4 


4020 


NIS+/- 


8 


7.5 


572 


30.2 


1290 


37.4 


2170 



Table 1: Simulation results for Example 1. All figures are computed/averaged over 100 independent runs. 



Method 


N = 1,000 
Time (ms) 


N = 5,000 
Time (ms) 


TV = 10,000 
Time (ms) 


MC 


1.8 


9.0 


17.8 


CDIS 


6.0 


27.8 


54.5 


NIS 


13.7 


29.2 


48.9 


NSIS 


14.1 


31.1 


52.1 



Table 2: CPU times for the option pricing example (Example 2) averaged over 1,000 independent runs. 







'Pi 




'P2 




Method 


N 


a = .75 


a = 3.5 


a =.75 


a = 3.5 


NSIS 


1,250 


1.59 


2.89 


0.58 


3.82 


GAIS 


1,250 


0.02 


3.45 


0.30 


1.11 


LAIS 


1,250 


0.75 


0.99 


1.92 


0.58 


NSIS 


5,000 


8.08 


4.50 


9.21 


5.09 


GAIS 


5,000 


5.88 


0.67 


0.96 


0.36 


LAIS 


5,000 


3.45 


1.30 


2.63 


0.42 


NSIS 


10,000 


9.38 


4.75 


11.06 


5.77 



Table 3: Relative efficiency of NSIS, GAIS, and LAIS compared to SIS for Example 3. Figures for NSIS 
arc computed over 1,000 independent runs. Figures for GAIS and LAIS are reprocessed from Table 2 in 
Givens and Raftery (1996). 





K = 5 




K=10 




K =20 


K = 30 


Method 


RE 


CV 


RE 


CV 


RE CV 


RE CV 


MC 


1.0 


.001 


1.0 


.14 






IS 


.3 


.01 


19.8 


.03 


.08 


.24 


NIS 


.2 


.02 


58.4 


.02 


.03 


.09 



Table 4: Results for the spam filter queueing application (single server case). Relative efficiency and 
coefficient of variation for the estimates of the probability that the queue length reaches level K. All figures 
are computed over 100 independent runs with 1 Mio. busy periods in each run. 
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K = 4 




K = 6 




K = 8 




Method 


RE 


CV 


RE 


CV 


RE 


CV 


MC 


1.0 


.007 


1.0 


.044 


1.0 


.34 


IS 


3.7 


.003 


7.0 


.017 


53.5 


.046 


NIS 


2.6 


.004 


24.3 


.009 


184.6 


.025 



Table 5: Results for the spam fihcr queueing apphcation (dual server case). Relative efRciency and 
coefficient of variation for the estimates of the probability that the queue length reaches level K. All figures 
are computed over 100 independent runs with 1 Mio. busy periods in each run. 
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Figure 1: Computational efficiency (measured by MSE x computation time) of MC (solid line), IS (dashed 
line), and NIS+/- (heavy line) for N = 1,000 (left), and N = 10,000 (right) for Example 1. All figures are 
computed/ averaged over 100 independent runs. 
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Figure 2: Convergence of NIS variance towards optimal IS variance for d = 1 (left) and d = 4 (right) for 
Example 1. All figures are computed over 10,000 independent runs. 
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Figure 3: Relative efficiency of GDIS (dotted line), NIS (heavy line), NSIS (dashed line), and crude MC 
(solid line) for Example 2 (strike Ki (left), strike K2 (right)) and 1,000 < N < 10,000. All figures are 
computed over 1,000 independent runs. 




Figure 4: Standard normal distribution (dashed line), optimally shifted normal distribution (dotted line), 
linear blend frequency polygon estimates {N = 5,000) of the optimal proposals (solid line), and gjf 
(heavy line) for Example 2. 
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Figure 5: Example 3: The upper plots are for the case a = .75 and the lower plots for a = 3.5. From left 
to right we have density p{xi,X2) and the optimal proposals and q^^- 
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Figure 6: Spam filter application: Histogram of the empirical interarrival times and Exponential distribu- 
tion with parameter .074 (left). Linear blend frequency polygon estimates of the service time distribution 
(solid line) and of the optimal proposal g°P* for the single server (dotted line) and dual server (dashed line) 
case for K = 10 (right). 
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Figure 7: Results for spam filter application: Estimated probabilities of the queue length to reach level K 
for single server (heavy line) and dual server (dashed line) case. 
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